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The (relevance) weighted likelihood was introduced to formally 
embrace a variety of statistical procedures that trade bias for pre- 
cision. Unlike its classical counterpart, the weighted likelihood com- 
bines all relevant information while inheriting many of its desirable 
features including good asymptotic properties. However, in order to 
be effective, the weights involved in its construction need to be judi- 
ciously chosen. Choosing those weights is the subject of this article 
in which we demonstrate the use of cross-validation. We prove the 
resulting weighted likelihood estimator (WLE) to be weakly consis- 
tent and asymptotically normal. An application to disease mapping 
data is demonstrated. 

1. Introduction. The weighted likelihood (WL for short) has been de- 
veloped for a variety of purposes. Moreover, it shares its underlying purpose 
with other methods such as weighted least squares and kernel smoothers 
which can reduce an estimator's variance while increasing its bias to re- 
duce mean-squared error (MSE), that is, increase its precision. However, 
the achievement of these gains depends on choosing the weights well, which 
is the subject of this article. More specifically, we show that they may be 
data dependent (i.e., "adaptive") and chosen by cross-validation. The idea of 
data-dependent weights goes back at least to the celebrated James-Stein es- 
timator, a WL estimator with adaptive weights that does successfully trade 
bias for variance [Hu and Zidek (2002)]. 

To describe the WL, we assume independent random response vectors 
Xi, . . . , X m with probability density functions /i(-; 6\ ),..., / m (s 8m), where 
X i = {X il ,...,X ini ) t . Further suppose that only population 1, in particular 
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0\, an unknown vector of parameters, is of inferential interest. Given data 
X = x, the classical likelihood would be 

L 1 (x 1 ,0 1 ) = f[f(x lj ;0 1 ). 

3=1 

When the remaining parameters 02, ...,0 m are thought to resemble 0\, 
the WL is defined as 

m rii 

WL( X ;0 1 ) = l[l[f 1 (x ij ;0 1 ) x >, 

i=lj=l 

where A = (Ai, . . . , A m ), the "weights vector," must be specified. Notice that 
the parameters from the remaining populations, 02, ■ ■ ■ ,0 m , unlike the data 
they generate, do not appear in the WL, since inferential interest focuses on 
0\ . It follows that 

m rii 

log WL (x; 1 ) = £ J2 Ai log /i (x y ; 0i) . 

i=ij=i 

The WL extends the local likelihood method of Tibshirani and Hastie (1987) 
for nonparametric regression, although the idea predates them [see Hu and 
Zidek (2002) for a review]. Following Hu (1997), Hu and Zidek (1995, 2001, 
2002) extend the local likelihood to a more general setting. However, the aim 
is the same. Their method also combines all relevant information in samples 
from populations thought to resemble the one of inferential interest. 
The maximum WL estimator (WLE) for 0\, say 0\, is defined by 

0\ = arg sup WL(x;0i). 
6»ieO 

In many cases the WLE may be obtained by solving the estimating equation: 

(0/00i)logWL(x;0i) = O. 

Note that uniqueness of the WLE is not assumed. 

Like the MLE, the WLE has a number of desirable properties [Hu and 
Zidek (2002)], in particular consistency and asymptotic normality under 
reasonable general conditions [Hu (1997) and Wang, van Eeden and Zidek 
(2004)]. However, these asymptotic properties have only been shown with 
fixed weights and hence need to be extended in this article to cover the 
estimators we obtain using cross-validation. 

In its most primitive but nevertheless useful form, the cross-validation 
procedure consists of controlled and uncontrolled division of the data sample 
into two subsamples. For example, a subsample can be generated by delet- 
ing one or more observations or it can be a random sample from the data 
set. Stone (1974) began the systematic study of cross-validatory choice and 
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assessment in statistical prediction. Both Stone (1974) and Geisser (1975) 
discuss its application to the K- group problem and use a linear combination 
of the sample means from different groups to estimate a common mean. 
Breiman and Friedman (1997) also demonstrate the benefit of using cross- 
validation to obtain linear combinations of predictors that perform well in 
multivariate regression. 

The article is organized as follows. The adaptive weights are derived in 
Section 2. The asymptotic properties of the resulting WLE are presented 
in Section 3. Results of simulation studies are discussed in Section 4. In 
Section 5 an application to disease mapping data demonstrates the benefits 
of using the proposed method in conjunction with the WLE when compared 
with traditional estimators. 

2. Choosing adaptive weights. For cross-validation there are many ways 
of dividing the entire sample into subsets, such as a random selection tech- 
nique. However, we use the simplest leave-one-out approach in this arti- 
cle since the analytic forms of the optimum weights are then completely 
tractable for the linear WLE. Denote the vector of parameters and the 
weight vector by 6 = (9i, 9%, . . . , 6 m , p) and A = (Ai, A2, • • • , A m ), respectively. 
Let A° pt and A° pt denote the optimum weight vectors to be defined in the 
sequel for samples with equal and unequal sizes, respectively. We require 
that Y^iLi Ai = 1 in this section and throughout this article. 

Suppose that we have m populations which might be related to each other. 
The probability density functions or probability mass functions are of the 
form fi(x;0i) with 9i as the parameter vector for population i. Assume that 

An, X\2, X13, X\ ni fi(x; 61), 

X2I, X22, X23, A~2n 2 '~ /2(^;#2)j 

X m li X m 2, A m 3, X m7lm ~ /m(^i^m)) 

where, for fixed i, the {A^} are observations obtained from population 
i and so on. Assume that observations obtained within each population 
are independent and identically distributed. Also observations from one 
population are independent of those from other populations except that 
Corr(Ajj, Afcj) = p, for any fixed j and i 7^ k. That is, observations hav- 
ing the same second subscripts are not necessarily independent even though 
they are from different populations. This would allow a spatial correlation 
structure but not a temporal one. We also assume that E(Xij) = 4>(9i) = cfii, 
say, for j = 1, 2, . . . , rii, i = 1, 2, . . . , m. The population parameter of the first 
population, 9\, is of inferential interest. 

Our cross-validatory approach of estimating the weights for the WLE 
flows from taking prediction as our inferential objective. In other words, we 
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seek an estimator 9\ of 9\ that enables us to predict accurately, in some 
sense, a randomly drawn element X% from the first population. But how 
should the precision of 9\ be assessed? 

One answer is the expected log score. Denoting by the expectation 
with respect to the conditional distribution of X\ given 9±, that score is 
.E[log/i(.Xi|0i)], an index of 9i's performance. 

However, the complexity of that index makes its use impractical in appli- 
cations such as that in Section 5. We therefore adopt an approximation as a 
compromise. To obtain the approximation, we assume a one-to-one mapping 
of #i into (</>i, Ti) where the range of <f>i covers that of X\. In fact, with an 
abuse of notion we represent 9\ by 6\ = (0i,ti) and 9\ in a similar way. We 
further assume that 



01og£[/i(Xi|0i)] 



30i 

and 

d 2 log£[/ipfi|<?i)] 







<o, 

<9 2 0i 

for all 9 with all higher-order derivatives being assumed to exist. These 
assumptions are satisfied for the normal distribution, for example, and more 
importantly for our application in Section 5, the Poisson distribution. 

Under these assumptions, the first-order term in a three-term Taylor ex- 
pansion of the expected log score vanishes. Therefore, ignoring irrelevant 
terms and factors, we obtain (0i — <pi) 2 as an approximation to the negative 
expected log score as a measure of 0i's precision. Finally, for its empirical as- 
sessment, we estimate the unknown <j)\ in this measure by X\. Moreover, we 
adopt that empirical measure to obtain adaptive weights by cross-validation. 
To that end, we use (— j) to indicate that the j'th item has been dropped 
from the sample. 

Taking the usual path, we predict X\j by (j>(Q~\_ the WLE of its mean 

without using the X\j. Note that <fi(0± ^) is a function of the weight vector 
A by the construction of the WLE. Based on the log score approximation 
above, a natural measure for the discrepancy of the WLE becomes 

«i 

(1) D(X) = Y / (Xi J -4>(9t j) )) 2 . 

i=i 

The optimum weights are derived such that the minimum of D(\) is achieved 
for fixed sample sizes n\, n^, • • • , n m and Y^!iL\ Ai = 1. 

If the inferential interest is on the means of some commonly used distri- 
butions from the exponential family, such as normal, binomial, exponential 
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and Poisson distributions, it then follows that <f>{9) is simply a linear combi- 
nation of those MLEs for each population. In this section we will investigate 
the behavior of the optimum weights by cross-validation for the linear case 
since we can derive the analytical forms of the optimum weights from (1). 

2.1. Linear WLEs for equal sample sizes. Stone (1974) and Geisser (1975) 
discuss the application of the cross-validation approach to the so-called K- 
group problem. Suppose that the data set S consists of n observations in 
each of K groups. The mean predictor for the ith group is 

jli = (1 - a)Xi. + aX.., 

where Xj. = ^Y^=i^ij an d X.. = -^YLlLi^-i-- ^ our interest focuses on 
group 1, the relevant predictor is 

/ K-l \— ™ a — 

Ml =(l-— a jx, + g-x,, 

where a is a parameter. Stone (1974) uses cross-validation to derive an 
optimal value for a. We remark that the above formula is just a particular 
linear combination of the sample means. 

We consider more general linear combinations and throughout this section 

~(e) 

assume n\ = ni = ■ ■ ■ = n m = n. Let 9\ denote the WLE obtained through 
cross-validation. If <p(9) = 9, the linear WLE for 9\ is then defined as 

m 

#1 = / j KXj., 
i=l 

where £^ = 1. 

In this section we will use cross-validation by simultaneously deleting 
X\j,X2j, ■ ■ ■ ,X m j for each fixed j. That is, we delete one data point from 
each sample at each step. This might be appropriate if these data points are 
obtained at the same time point and strong associations exist among them. 
By simultaneously deleting X\j , A^j , • • • , X m j for each fixed j , we might 
achieve numerical stability of the cross-validation procedure. An alternative 
approach is to delete a data point from only the first sample at each step. 
That approach will be studied in this section as well. 

Let X im be the sample mean of the ith sample with jth element in that 
sample excluded. A natural measure for the discrepancy of 9\ might be 

n / m \ 2 

j=l \ i=l J 

= c(X ) - 2A*6 e (X) + A*,4 e (X)A, 
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where c(X) = E]=i X%, (6 e (X ))< = E"=i and (4,(20)* = £? =1 X<f J ' } X ^ , 

1,2, ... ,n,k = 1,2, ... ,m. For expository simplicity, let b e = b P ( X ) and A e = 
A P ( X) in this article. 

An optimum weight vector by the cross-validation procedure is defined 

to be a vector that minimizes the objective function and satisfies 

2.1.1. Two-population case. For simplicity, first consider the simple case 
of just two populations, 

X\\, X\2, X13, X\ n fi(x; 6i, af), 

X2I, X22, X23, AT 2n '~ ' f2{x] O2, 02), 

with E(Xtj) = Oi, E(X 2j ) = 6 2 , Var(Xij) = a\ and Var(X 2j ) = at Further- 
more, assume that p = cor(Xij,X2j), j = 1,2, . . . ,n. Denote 0° = (O^O®) 
where 6® and Q\ are the true values for Q\ and 62, respectively. 

We seek the optimum weights Ai and A2 with Ai + A2 = 1 such that they 
minimize the following objective function: 

n 

= - Mxt^ - A 2 X 2 ( : J) ) 2 - 7 (Ai + A 2 - 1). 

j=i 
(2) 

Differentiating De with respect to Ai and A 2 , we have 

^xr = - - - Mxi: j) ) -7=0, 

( 2 ) ~ n (2) \ 

d\2 j=1 
It follows that 

Era /^vf( — j) ^vf(~j)\ /^v r ( — i) \ 
j=\{^\. ~A 2 . J(Ai. -AijJ 

A° pt (X) = l-A° pt (X). 

Lemma 2.1. The following identity holds: 

A opt = l-A° pt and A^Sf/Sf, 

where 

n ( n ~ 2 W — ^2 , 1 _ 

-1^2 



^1 ~~ 7 — TvfC^ 1 - ~~ ^2-) 2 -I — 7 — TT2 y^AXij - X 2 j) 2 , 



S e 2= , n i N 2 (^i-cov) 

(77, — l) z 



SELECTING LIKELIHOOD WEIGHTS BY CROSS-VALIDATION 



7 



where a\ 2 = ± E"=i(*y " ^i-) 2 and cot = ± E?=i(*ij " - X 2 .). 

The value of \^ can be seen as some sort of measure of relevance between 
the two samples. If this "measure" is almost zero, the formula for A°; pt will 
assume a very small value. This implies that there is no need to combine the 
two populations if the difference between the two sample means is relatively 
large or the second sample has little relevance to the first one. The weights 
chosen by the cross-validation procedure will then guard against the undesir- 
able scenario in which too much bias might be introduced into the estimation 
procedure. On the other hand, if the second sample does contain valuable 
information about the parameter of interest, then the cross-validation pro- 
cedure will recognize that by assigning a nonzero value to A^ ■ Note that 
knowledge of the variances and correlation is not assumed. 

Proposition 2.1. If p< fj, then 

p o(Ar>o)^i. 

We remark that the condition p < o~\/o~2 is satisfied if 02 < o"i or p < 0. If 
the condition p < cr\jai is not satisfied, then A°> pt will have a negative sign 
for sufficiently large n. However, the value of A°; pt will converge to zero as 
shown in the next proposition. 

PROPOSITION 2.2. If 9® ^ 9%, then, for any given e > 0, 

^>(|A° pt - 1| <e) — >1 and P o ( | A° pt | < e) — ► 1. 

The asymptotic limit of the weights will not exist if 9® equals 9®- The 
cross-validation procedure will not be able to detect the difference of the 
two populations if there is none. This problem can be solved by defining 
^2 Pt = s^+s w here S e is a small positive constant. 

2.1.2. Alternative matrix representation of the optimum weights. In or- 
der to handle more than two populations, it is necessary to derive an al- 
ternative matrix representation of A opt . Define e n = -^j- It can be verified 
that 
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Thus, we have 
where 

— 3^2-5 2 = 1,2,..., 772, 

_____ 1 n 

covjfc = ~Xi.)(x k j -x k .). 

Ti . 

Recall that, for 1 <i <m and 1 < k < m, 

n 

i=i 

It follows that 

(5) A e = + (e 2 n (n - 2) + -!»-) flfl*, 

77 — 1 V n — 1/ 

where Ejfc = covjfc and = (a?i., . . . ,x m .). 
We also have 

n 

(6) &e(i)(x) = Aij ___7T$3(»lj 

It then follows that 

(7) b e (x) = A 1 -e 2 n Z 1 , 

where Ai is the first column of A e and Si is the first column of the sample 
covariance matrix S. We are now in a position to derive the optimum weights 
in matrix form when the sample sizes are equal. 

Proposition 2.3. The optimum weight vector which minimizes 
takes the form 

\T = (1,0,0,..., 0)* - e 2 n (Vex - ^p^l) . 

We remark that A e is invertible since £ is invertible. Note that the ex- 
pression of the weight vector in the two-population case can also be derived 
by using the matrix representation given as above. 
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2.2. Linear WLE for unequal sample sizes. In the previous section we 
discussed choosing the optimum weights when the samples sizes are equal. 
In this section we propose to use the cross-validation procedure for choosing 
adaptive weights for unequal sample sizes. If the sample sizes are not equal, 
it is not clear whether the delete- one- column approach is reasonable. For 
example, suppose that there are 10 observations in the first sample and 
there are 5 observations in the second. Then there is no observation to 
delete for the second sample for the second half of the cross-validation steps. 
Furthermore, we might lose accuracy in prediction by deleting one entire 
column when sample sizes are small. Thus, we propose an alternative method 
that deletes only one data point from the first sample and keeps all the data 
points from the rest of the samples when the sample sizes are not equal. 

2.2.1. Two-population case. Let us again consider the two populations. 
The optimum weights A° pt are obtained by minimizing the objective function 

m 

D u 2 \\) = £(X y - AxXj:^ - A 2 Z 2 .) 2 , 
i=i 

where YJILi \ = 1 and x[7 = E/Sy x ik- We remark that the major 

(2) (2) 

difference between D e and D u is that only the jth data point of the first 

(2) 

sample is left out for the jth term in D u ' . 

r 2 \ 

Under the condition that Ai + A 2 = 1, we can rewrite D u as a function 
of Ai: 

m 

Di 2) = £(A%- - XiX[7 j) - (1 - A!)X 2 .) 2 

3=1 

m 

= - x 2 .) + x 1 (x 2 . - x[: j) )f. 

3=1 
(2) 

By differentiating D u with respect to Ai, we then have 

/ fi x ,o P t_ n 1 (X 1 .-X 2 .) 2 -(n l /(n 1 -l))af opt _ opt 

A l — 7W v 7 \2 i ( 77 r \ 2 N-2' A 2 — 1 ~ A i • 

The adaptive optimum weights still converge to (1,0) when the sample sizes 
are not equal. 



Proposition 2.4. If 6%^ 9%, then A° pt 1 and A 2 pt 0. 
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2.2.2. Optimum weights by cross-validation. We now derive the matrix 
representation for the optimum weights by cross-validation when the sample 
sizes are not all equal. The objective function is defined as follows: 

m / m \ 2 

4 m) = E Uy - AxX™ - 

j=l \ i=2 ) 



c(X)-2b(X)\ u + \lA(X)K, 



where 



~{ \ ni - 1 / ni - 1 

&i = niXi.Xj., i = 2,...,m, 



and 



Oy = niXi.Xj., f/lorj/1. 
It then follows that 

A = ni0i,§2, ■ ■ ■ , e m ) t 1 ,§ 2 , ...,8 m ) + D, 

where 

a ni -2 

(m - 1)2 

dij = 0, £ ^ 1 or j ^ 1. 
By the elementary rank inequality, it follows that 

rank(^) < rank(6>*(9) + rank(£>) = 2. 

It implies that 

rank(A) < m if m > 2. 

Since ^4 is not invertible for m > 2, the Lagrange method will not work 
in this case. The ^-inverse of the matrix A could be used instead. 

3. Asymptotic properties of the weights. In this section we present the 
asymptotic properties of the cross- validated weights for the general case. Let 
6\ be the MLE based on the first sample of size n\. Let 0\ and 6\ be 
the MLE and WLE, respectively, based on m samples without the jth data 
point from the first sample. This generalizes the two cases where either only 
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the jth data point is deleted from the first sample or the jth data point from 
each sample is deleted. Note that 8[ ^ is a function of the weights function 
A. Let ^D ni be the average discrepancy in the cross-validation given by 

-D m (\) = -Y / (X lj -<j>(9{- j) )f. 



Let \( cv ^ be the optimum weights chosen by cross-validation. We require 
that YT=i x i = L Let e ° = (0l,O 2 ,...,6 m ), where 0? is the true value of B x . 
We then have the following theorem. 

Theorem 3.1. Assume that: 

(i) ^D ni has a unique minimum for any fixed n\\ 

(ii) ^E-i 1 (^!" j) )-^i))^0 osni-oo; 

(iii) P e o(± E]Li(Xij -cb(e[~ j) )) 2 <K)^l for some constant < K < 

oo; 

(iv) iX|(X#r) - ^Hl > M ) = °(^) f° r some constant < M < oo. 

(9) A(™)^>w = (l,0,0,...,0) t . 

The assumptions of the above theorem are satisfied by the linear- WLE 
case presented in Section 2. We state that fact in the following corollary 
whose proof is straightforward and omitted for brevity. 

Corollary 3.1. Suppose Xn,Xi2, . . . , Xi n are independent with density 
function f(x, 6i), i = 1, 2. If the WLE has linear form and \x\ ^ [12, then 

(10) A M ^Hw = (l,0)*. 

Furthermore, Theorem 3.1 also applies to cases in which the WLE does 
not have the linear form. One such important case involves the log-normal 

distribution, which is widely used in practice. Suppose Xij u ~' LN^, 1), j = 
1, . . . , n, i = 1, 2, where /ij and 1 denote, respectively, the mean and standard 
deviation of the log Ajj for all i and j. It can be verified that, for i = 1,2, 

E^{X l3 ) = 4>{$) = e^ +l l\ j = l,2,...,n. 
It also follows that the MLE and the WLE are given by 

1 n 

(11) MLE( W ) = / il = -^log(x lj ), 

3=1 

A n X n 

(12) WLE(/xi) = /ii = — lo g( x ii) + — E lo s(^i), 

n 3=1 n 3=1 
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where Ai + A2 = 1. 
Therefore, 

(13) <t>{fc j) ) = expj log(X lfc ) + 1/2} , 

I n 1 Mi ) 

(14) = exp{ £ log(X lfc ) + - A- ]T log(X 2fc ) + I/2) , 

l n X Mi n Mi J 

for j = 1,2,..., n. 

Therefore the average discrepancy of cross-validation for the log-normal 
case is given by 



(15) 



+ ^ I E 1 ^) + l/2}) • 



Mi 

Since we require that Ai + A2 = 1, we can rewrite the average discrepancy 



as 



,...,71. 



(16) -D n (l - A 2) A 2 ) = - e ^W^ } -^Wf , 
where 

F ^ = 4rE y * and ^j=log(-Xij), i = l,2,i = l,2 
n 1 Mi 

We then have the following lemma and corollary. 

Lemma 3.1. Assume that Xn,Xi2, ■ ■ ■ ,Xi n are independent random vari- 
ables and follow the log-normal distribution with parameters i = 
1,2. Let A 2 (ra) be the optimum weight that minimizes ^D n (l — A 2 ,A 2 ) for 
any fixed n. If fi\ /i 2 , then (i) ^D n (l — A 2 ,A 2 ) is strictly convex; (ii) 
lim n _,. 00 A|(n) exists and | lim^—nx, A 2 (n)| < 1 mf/i probability 1. 

Corollary 3.2. Under the assumptions of Lemma 3.1, 7^ /z 2 , i/ieri 

(17) A^) ^ w = (1,0)*. 

Wang, van Eeden and Zidek (2004) establish the asymptotic normality of 
the WLE for fixed weights. Under certain regularity conditions and by The- 
orem 3.1, we then have the following asymptotic results for using adaptive 
weights. 
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Theorem 3.2. For each 9\, the true value of 9\, and each 9\ j^9\, 

n n /(^i°) A * )(x) > n n /(*^i) A * (x) = i, 

i=l j=l i=l j=l I 

for any 2 , 3 , . . . , 9 m , 9i e 6, i = 2, 3, . . . , m. 

Theorem 3.3. For any sequence of maximum weighted likelihood es- 
timates 9~[ ni ^ of Q\ constructed with adaptive weights A,- n ^(X) ; and for all 
£>0, 

n lim o P e o(||^ ni) -^||>£) = 0, 
for any 6 2 , 3 , . . . , 9 m , Oi € G, i = 2, 3, . . . ,m. 

We assume that the parameter space is an open subset of BP . The asymp- 
totic normality of the WLE constructed by cross- validated weights follows. 

Theorem 3.4 (Multidimensional). Suppose: 

(i) for almost all x the first and second partial derivatives of fi{x; 9) with 
respect to 9 exist, are continuous in 9 6 0, and may be passed through the 
integral sign in f f\(x; 9) dv(x) = 1; 

(ii) there exist three functions G\{x), G 2 (x) and G%{x) such that for 
all 9 2 , • • • > &m, E o | Gi (Xij ) | 2 < K[ < oo, I = 1, 2, 3, i = 1, . . . , m, and in some 
neighborhood of 9® each component of tp(x) = -§gfi(x',0) [resp. ip{x)] are 
bounded in absolute value by G\{x) [resp. G 2 (x)\ uniformly in 9± S 0. Fur- 
ther, 

d 3 log /i (a;; 0i ) 
99\kx 99\k 2 d9\k 3 ' 

ki,k 2 ,k^ = l,. . . ,p, are bounded by G%(x) uniformly in 9\ G 0; 
(hi) 1(0?) is positive definite. 

Then there exists a sequence of roots of the weighted likelihood function 
based on adaptive weights 0^ ni ^ that is weakly consistent and 

V^(0i ni) -0?) -^JV(O,/(0?)) as m^oo. 

4. Simulation studies. To demonstrate and verify the benefits of using 
cross-validation procedures described in previous sections, we perform simu- 
lations according to the following algorithm that deletes the jth point from 
each sample, that is, a delete- one- column approach. Let [i® and /j,® denote 
the true values of the parameters. Let C = ^ — fi 2 , which is the difference 
between the two population means. 
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Table 1 

MSE * 100 of the MLE and the WLE and their standard deviations * 100 for samples of 
equal sizes generated from N(0, 1) and N(0.3, 1). A correction term is employed in the 
calculations of the optimum weights to handle numerical instability 



n MSE(MLE) SD of (MLE - 0?) 2 MSE(WLE) SD of (WLE - 0?) 2 



10 10 15 8 12 80 

20 4 6 4 5 85 

30 3 4 3 4 87 

40 3 4 2 3 91 

50 2 3 2 2 92 

60 2 2 2 2 94 



Step 1. Draw random samples of size n from /i(x;//5) and ^(^iM^)- 
Step 2. Calculate the cross- validated optimum weights by using (3). 
Step 3. Calculate (MLE - /i?) 2 and (WLE - / u?) 2 . 

Repeat Steps 1-3, 1000 times. Calculate the averages and standard devia- 
tions of the squared estimation error differences for both the MLE and WLE. 
Calculate the averages and standard deviations of the optimum weights. 

We generate random samples from iV (//]*, cr 2 ) and iV^/i^o - !) where we 
set o"i = o"2 = 1 for simplicity. For the purpose of the demonstration, we set 
fj-i = and ft® = 0.3, which is 30% of the variance. Table 1 shows some results 
for the case n\ = and $ = 0.3. Setting /i? = 0, we tried other values for C . 
In general, the larger the value of C, the less improvement in the MSE. For 
example, if we set a® = a® = 1 and C = [1% ~ A*i = 1> the ratio of the MSE 
for MLE and WLE will be almost 1. This implies that the cross-validation 
procedure will not make much use of the second sample in this situation. 

It is obvious from Table 1 that the MSE of the WLE is much smaller 
than that of the MLE for small and moderate sample sizes. The standard 
deviations of the squared differences for the WLE are less than or equal to 
those of the MLE. This suggests that not only the WLE achieves smaller 
MSE but also its MSE has less variation than that of the MLE. Intuitively, 
as the sample size increases, the importance of the second sample diminishes. 
As indicated by Table 2, the cross-validation procedure realizes this and then 
assigns a larger value to Ai as the first sample size increases. The optimum 
weights do increase towards the asymptotic weights (1,0) for the normal 
case, albeit quite slowly. 

We repeat the procedure for Poisson distributions with V(3) and V(3.6). 
Some of the results are shown in Tables 3 and 4. The results for the Poisson 
distributions differ from the normal case. The most striking difference is in 



SELECTING LIKELIHOOD WEIGHTS BY CROSS-VALIDATION 



15 



the ratio of the WLE's average MSE versus that of the MLE. The WLE 
achieves a smaller average MSE when the sample sizes are less than 30. 
These results contrast with the normal case, where the critical sample size 
is 45. 

We remark that the reduction in MSE will disappear if we set C = fi® — 
Hi = 1.5 in the above case. Thus, the cross-validation procedure will not 
combine the two samples if the second sample does not help to predict the 
behavior of the first. We should emphasize that the value C in both cases is 
not used in the cross-validation procedure itself. 

We remark that simulations using the delete- one-point approach have also 
been done. They give quite similar results. 

5. Application to disease mapping. In this section we address the prob- 
lem of analyzing disease mapping data. In particular, we demonstrate a 
weighted likelihood alternative to the hierarchical Bayes approach that has 

Table 2 

Average optimum weights * 100 and their standard 

deviations * 100 for samples of egual sizes 
generated from N (0,1) and N(0.3,l). A correction 
term is employed in the calculations of the 
optimum weights to handle numerical instability 



n 


AVE. of Ai 


AVE. of A 2 


SD of Ai and A2 


10 


79 


21 


6 


20 


85 


15 


4 


30 


88 


11 


3 


40 


90 


10 


3 


50 


91 


9 


2 


60 


92 


8 


2 



Table 3 

MSE * 100 of the MLE and the WLE and their standard deviations * 100 for samples of 
equal sizes generated from V(3) and V(3.6). A correction term is employed in the 
calculations of the optimum weights to handle numerical instability 



n 


MSE(MLE) 


SD of (MLE - 0?) 2 


MSE(WLE) 


SD of (WLE - O 2 


mse(wle; 
mse(mle; 


10 


31 


45 


27 


40 


86 


20 


15 


22 


14 


19 


90 


30 


10 


14 


9 


13 


94 


40 


8 


11 


8 


10 


96 


50 


6 


8 


5 


8 


97 


60 


5 


8 


5 


7 


97 
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been used in references cited in the discussion section. Our approach al- 
lows the data themselves to select the weights through cross-validation. We 
thereby avoid the (need of a prior for modeling) in order to guess the la- 
tent patterns of environmental hazards that may lead to the adverse health 
effects being mapped. Such hazards include air pollution that has been as- 
sociated with respiratory morbidity [see, e.g., Burnett and Krewski (1994) 
and Zidek, White and Le (1998)]. 

Our demonstration involves parallel time series of weekly hospital admis- 
sions for respiratory disease in residents of 733 census subdivisions (CSD) in 
southern Ontario. The data are collected from the May-to-August periods 
from 1983 to 1988. In this demonstration we confine attention to certain 
densely populated areas. 

Let us consider the problem of estimating the rate of weekly hospital 
admissions of CSD 380, the one with the largest total annual hospital ad- 
missions among all CSDs from 1983 to 1988. This proves to be a challenging 
task due to the sparseness of the data set. The original data set contains 
many 0's, representing no hospital admissions. For example, although CSD 
380 has the largest total number of hospital admissions among all the CSDs, 
no patient was admitted during 112 out of the 123 days in the summer of 
1983. On some days, however, quite a number of people sought treatment 
for acute respiratory disease possibly due to high levels of air pollution in 
their regions. Again referring to CSD 380, 17 patients were admitted on day 
51 alone in 1983. 

A more graphical description of these irregularities in admission counts for 
this CSD is seen in Figure 1. There daily counts are shown and the problems 
of data sparseness and high level of variations are extreme. In fact, in this 
demonstration we have chosen to avoid the complexities of modeling these 
daily data series and we turn instead to weekly counts. While those problems 

Table 4 

Average optimum weights * 100 and their standard 
deviations * 100 for samples of equal sizes 
generated from V (3) and P(3.6). A correction 

term is employed in the calculations of the 
optimum weights to handle numerical instability 



n 


AVE. of Ai 


AVE. of A 2 


SD of Ai and A 2 


10 


80 


20 


7 


20 


86 


14 


5 


30 


88 


12 


4 


40 


90 


10 


3 


50 


92 


8 


3 


60 


92 


8 


2 
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Fig. 1. Daily hospital admissions for CSD 380 in the summer of 1983. 



remain, they are not nearly so acute. In total, each of the summers in the 
years covered by our study has 17 weeks. For simplicity, the data obtained 
in the last few days of each summer are dropped from the analysis since 
they do not constitute a whole week. 

5.1. Weighted likelihood estimation. We assume the weekly hospital ad- 
missions for any given CSD follow Poisson distributions, that is, for year q, 
CSD i and week j, 

Yij^W^j), J = l,2,...,17;i = l,2,...,733; (? = l,2,...,6. 

The raw estimates of namely Y^, are highly unreliable since the ef- 
fective sample size in this case is 1. Moreover, each CSD may contain only a 
small group of people who suffer respiratory diseases. These considerations 
point to the need to "borrow strength," a standard tool of disease mapping 
techniques. That is, the information in neighboring CSDs can be combined 
to produce more reliable estimates while introducing only a small amount 
of bias. 

For any given CSD, the "neighboring" CSDs are defined to be CSDs in 
close proximity to CSD 380. To estimate the rate of weekly hospital admis- 
sions in a particular CSD, we would expect that neighboring subdivisions 
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Fig. 2. Hospital admissions for CSDs 380, 362, 366 and 367 in 1983. 

contain relevant information which might help us to derive a better esti- 
mate than the traditional sample average. Thus, the Euclidean distances 
between the target CSD and other CSDs in the data set are calculated by 
using the longitudes and latitudes. We apply a somewhat arbitrary thresh- 
old, 0.2, to the Euclidean distances in order to define neighbors. For CSD 
380, neighboring CSDs turn out to be CSDs 362, 366 and 367. 

The time series plots of weekly hospital admissions for those selected CSDs 
in 1983 are shown in Figure 2. Hospital admissions of these CSDs indeed 
seem to be related since the major peaks in the time series plot occurred at 
roughly the same time points. However, as noted earlier, the data from other 
CSDs may introduce bias. Thus the WLE's weights are needed to control 
the degree of bias. 

To find cross-validatory choices for these weights, we consider purely as a 
working assumption that Ofj = Of for j = 1,2,..., 17. In fact, that assumption 
does not seem tenable since every year week 8 has markedly larger numbers 
of hospital admissions for CSD 380 than the remaining weeks. For example, 
in 1983, there are 21 admissions in week 8 while the second largest weekly 
count is only 7 in week 15. Thus, we are forced to drop week 8 from our 
working assumption and instead assume Of, = Q\ for j = 1,2, ... , 7, 9, ... , 17. 
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In fact, the sample means and variances of the weekly hospital admissions 
for those 16 weeks of CSD 380 are quite close to each other, in support of 
our assumption. One alternative to assuming the constancy of weights over 
the whole summer would be the use of a moving window just a few weeks 
in width. We leave that option for future work. 

For Poisson distributions the MLE of 8f is the sample average of the 
weekly admissions of CSD 380, while the WLE is a linear combination of 
the sample averages for each CSD. Thus, the weighted likelihood estimate of 
the population mean of weekly hospital admissions for a CSD is 

WLE« = ^A?F«, 5 = 1,2,. ..,6, 

i=l 

where is the overall sample average of CSD i for year q. 

In our analysis the weights are selected by the cross-validation proce- 
dure proposed in Section 2. Recall that the cross- validated weights for equal 
sample sizes are 

where b q {y) = £}L a Y^Ff.^ and A q (y) lk = E} =1 F^F*^, i = 1, 2, 3, 4; 
k = 1,2,3,4. 

5.2. Results of the analysis. We assess the performance of the MLE and 
the WLE by comparing their MSEs. The MSEs of the MLE and the WLE 
are defined by, for q = 1, 2, . . . , 6, 

M^ M (6\) = E el (Y\. - 6\)\ 



In fact, the 8f : s are unknown. We then estimate the MSEa/ and MSE^/ by 
replacing 9\ by the MLE. Under the assumption of Poisson distributions, 
the estimated MSE for the MLE is given by 

MSE^ = var(Ki)/16, g = l,2,..., 6. 
The estimated MSE for the WLE is given as follows: 

/ m. \ 2 



mse^=£( 5>«n-*?j 

/ m \ / m 

= Var £A?F?. + ££A?Ff 



,i=l / \ i=l 
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Table 5 

Estimated MSEs for the MLE and the WLE. All entries have been 
multiplied by 100. The MSEs have also been multiplied by 16 since there 

are 16 weeks 



Year 


MLE 


WLE 


16 * MSEm 


16 * MSE^/ 


MSEw/MSElr 


1 


19 


17 


10 


8.4 


80 


2 


33 


28 


24 


13 


87 


3 


2.3 


26 


29 


14 


54 


4 


15 


22 


16 


8.4 


96 


5 


30 


32 


30 


13 


80 


6 


38 


41 


41 


24 


54 



-EE a? Ajcsvcn,?;.) + e >wi - n • 

i=lfc=l \i=l / 

The estimated MSEs for the MLE and the WLE are given in Table 5. It can 
be seen that the MSE for the WLE is much smaller than that of the MLE. 
In fact, the average reduction of the MSE by using WLE is about 25%. 

Combining information across these CSDs might also help us in predic- 
tions since the patterns exhibited in one neighboring location in a particular 
year might manifest themselves at the location of interest the next year. To 
assess the performance of the WLE, we also use the WLE derived from one 
particular year to predict the overall weekly average of the next year. The 
overall prediction error is defined as the average of those prediction errors. 
To be more specific, the overall prediction errors for the WLE and the MLE 
are defined as follows: 



PRED 



M 



\ 



\i:(n-Y? l y 

9=1 



PRED 



w 



\ 



lEtWLEO-F?; 



q+ls 



The average prediction error for the MLE, Predjv/? is 0.065, while Pred^y, 
the average prediction error for the WLE, is 0.047, which is about 72% of 
that of the MLE. 

From Table 6, we see that there is strong linear association between CSD 
380 and CSD 366. However, the weight assigned to CSD 366 is the smallest 
one. It shows that CSDs with higher correlations contain less information 
for the prediction since they might have patterns too similar to the target 
CSD for a given year to be helpful in the prediction for the next year. Thus 
CSD 366, which has the smallest weight, should not be included in the 
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analysis. Therefore, the "neighborhood" of CSD 380 in the analysis should 
only include CSD 362 and CSD 367. 

In general, we might examine those CSDs that are in close proximity 
to the target CSD. We can calculate the weight for each selected CSD by 
using the cross-validation procedure. CSDs with small weights should be 
dropped from the analysis since they are not deemed to be helpful. 

The predictive distributions for the weekly totals will be Poisson as well. 
We can then derive the 95% predictive intervals for the weekly average 
hospital admissions. This might be criticized as failing to take into account 
the uncertainty of the unknown parameter. Smith (1999) argues that the 
traditional plug-in method has a small MSE compared to the posterior mean 
under certain circumstances. In particular, it has a smaller MSE when the 
true value of the parameter is not large. Let CI^ and CIm be the 95% 
predictive intervals of the weekly averages calculated from the WLE and 
the MLE, respectively. The results are shown in Table 7. 

The weighted likelihood framework discussed in this article requires the 
observations obtained from each population to follow the same distribution. 
However, including the week 8 data would violate that assumption. Including 
them in the analysis would have negative impact on the analysis by inval- 
idating the homogeneity assumption of our model. Nevertheless, we re-did 
the analysis to see that impact. The adaptive weights and the correlation 



Table 6 

Correlation matrix * 100 and the weights * 100 for 1984 





CSD 380 


CSD 362 


CSD 366 


CSD 367 


Weights 


CSD 380 


100 


42 


91 


57 


46 


CSD 362 


42 


100 


40 


63 


20 


CSD 366 


91 


40 


100 


55 


12 


CSD 367 


57 


63 


55 


100 


22 



Table 7 
Predictive confidence intervals 
of the MLE and the WLE for 



CSD 380 


Year 


Cljvf 




1983 


[0, 3] 


[0, 3] 


1984 


[0, 5] 


[0, 4] 


1985 


[0,4] 


[0. 4] 


1986 


[0, 3] 


[0. 4] 


1987 


[0,4] 


[0. 5] 


1988 


[0, 5] 


[0, 6] 
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matrix for 1986 are shown in Table 8. We observe that the weight for the 
population of interest is almost 0. This is not acceptable since the inference 
will ignore the data from the first population. In this case, week 8 for CSD 
380 has an observation that is almost 20 times larger than the rest of them. 
Since the cross-validation procedure is based on the predictive mechanism, 
thus it is difficult for the procedure to rely on the data points from the first 
population for accurate predictions. As a result, it will assign large weights 
to the other CSDs, especially those less correlated with the target one or 
having a smaller variance. Consequently, the weights will not be able to con- 
trol the bias as they are designed to. Instead, they will introduce large bias 
into the inference. 

Table 9 presents the results obtained when the data from week 8 are 
dropped for 1986. As in Table 6, a large weight, about 50%, is put back 
onto CSD 380, the population of interest. Therefore, data from week 8 must 
be dropped from the analysis in order to control the bias. We discuss some 
alternative methods for detecting unusual weeks in the discussion section. 
In principle, we could fit a separate model for that week. But here it would 
not be feasible because of the rather small sample size. We note that the 
MLE and WLE are both unstable for small sample sizes although the WLE 
will have better performance as shown in the simulation study. 

6. Discussion and future work. The asymptotic results established in 
this article are based on the assumption that the sample size of the popula- 



Table 8 

Correlation matrix * 100 and the weights * 100 for 1986 when 
week 8 is included in the analysis 





CSD 380 


CSD 362 


CSD 366 


CSD 367 


Weights 


CSD 380 


100 


88 


74 


22 


0.1 


CSD 362 


88 


100 


76 


32 


28 


CSD 366 


74 


76 


100 


44 


30 


CSD 367 


22 


32 


44 


100 


42 



Table 9 

Correlation matrix * 100 and the weights * 100 for 1986 when 





week I 


B is excluded 


in the ano 


iysis 






CSD 380 


CSD 362 


CSD 366 


CSD 367 


Weights 


CSD 380 


100 


23 


19 


7.6 


48 


CSD 362 


23 


100 


38 


29 


18 


CSD 366 


19 


38 


100 


44 


31 


CSD 367 


7.6 


29 


44 


100 


2.6 
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tion of interest goes to infinity. They do not apply to the situation when the 
sample size for the population of interest remains small or moderate while 
the sample sizes of other populations go to infinity. If the sample size of the 
population of interest is very small, say either 1 or 0, and the number of 
populations goes to infinity, then the asymptotic paradigm proposed by Hu 

(1997) would be appropriate. 

There are other choices of weights function proposed in the literature. 
In the context of local likelihood discussed by Copas (1995), Tibshirani 
and Hastie (1987) and Eguchi and Copas (1998), the weight function there 
is essentially a kernel function with center t and bandwidth h. Hunsberger 
(1994) proposes a weight function that assigns zero weight to an observation 
if it is outside a certain neighborhood. Since a kernel-type weight function 
uses Euclidean distance, it might not reflect the underlying spatial structure 
well as we have seen in the disease mapping example. Hu and Rosenberger 
(2000) propose weight functions in analyzing adaptive designs when time 
trends are present. They investigate two classes of weight functions, namely 
the exponential and polynomial types. But the weight function proposed in 
this article does not assume any specific functional form or rely on the choice 
of distance function. The adaptive weights chosen by cross-validation are 
data dependent and determined solely by minimizing the proposed predictive 
discrepancy measure. 

The analysis presented in Section 5 is merely a demonstration of the 
weighted likelihood method. Through exploratory analysis, we find that data 
from week 8 are quite different from the rest of the weeks. Therefore they 
were dropped from the analysis. Given the high dimensionality and actual 
sizes of current data sets in disease mapping, it is not always practical to 
detect those unusual weeks by manual exploratory analysis. One automatic 
approach to detect patterns for the weekly data is to partition those weeks 
into homogeneous subgroups by using some clustering algorithms. Unlike 
the standard clustering in disease mapping that is normally done on the 
spatial grid, the grouping in our case should be done on the temporal scale. 
We applied a standard K-means algorithm with two clusters to the data 
set. The K-means clustering algorithm successfully identified week 8 as the 
only member of one cluster and the rest of the weeks were assigned to 
another cluster. When the number of clusters is unknown, it then must be 
estimated. The estimation of number of clusters is a very difficult problem 
in cluster analysis. It is beyond the scope of this article. Fraley and Raftery 

(1998) discuss the problem of determining the structure of clustered data 
without prior knowledge of the number of clusters. Cheeseman and Stutz 
(1996) propose an algorithm, the so-called AutoClass, that can estimate 
the number of clusters and then perform the partition. Once the partition 
is achieved, the weighted likelihood method can then be applied to those 
clusters separately. One of our future works is how to combine the results 
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from different clusters in a sensible way. Furthermore, the spatial structure 
is incorporated into the weighted likelihood through the adaptive weights. 
However, the current model cannot handle temporal structures. One natural 
extension of the proposed weighted likelihood framework is to extend it to 
handle both spatial and temporal structures. 

Bayes methods including empirical and hierarchical Bayes methods are 
widely used in analyzing disease mapping data. Manton et al. (1989) dis- 
cuss the empirical Bayes procedures for stabilizing maps of cancer mortality 
rates. Ghosh, Natarajan, Waller and Kim (1999) propose a very general hi- 
erarchical Bayes spatial generalized model that is considered broad enough 
to cover a large number of situations where spatial structures need to be 
incorporated. In particular, they propose the following: 

9i = qi = x\h + Ui + Vi, i = 1,2,..., m, 

where the q,- are known constants, X{ £1X6 covariates, U{ and v i are mutually 

independent with V{ 1 ~ ' N(0,a 2 ) and the Ui have joint probability density 
function 

f(u) oc ({a u ) 2 r 1/2m expi-Y,^2{ui-u j ) 2 w ij /(2al)\, 

\ i=i j^i / 

where Wij > for all 1 < i ^ j < m. The above distribution is designed to 
take into account the spatial structure. In that paper, they propose to use 
Wij = 1 if locations i and j are considered neighbors. They also mention 
the possibility of using the inverse of the correlation matrix as the weights 
function. We argue that the weights chosen by the cross-validation proce- 
dure can discover the underlying spatial structure without any parametric 
assumption. Thus those weights might be helpful in selecting an appropriate 
distribution that models the underlying spatial structure. Further analysis 
is needed if one wants to fully compare the performances of the WLE, the 
MLE and the Bayesian estimator in the context of disease mapping. 

APPENDIX 



Proof of Lemma 2.1. Observe that 



xl ^ — e n Xi. -Xi 

n - 1 



where e n = 



Let Sf = iE"=i(4 j) ~ x i j) ) 2 . It then follows that 
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since ^(Xi. - Xy) = 



n n n 



n(n — 1) 



~n Xl " E( Xl i ~ ~ E X ii + E X ij X V 

j=l j=l j=l , 



I 1 n 1 n 
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n 



i - X 2.)--E *y + - E 
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- cov). 



(n-1) 2 
This completes the proof. □ 

Proof of Proposition 2.1. By the weak law of large numbers, it 
follows that 

a\ — cov — ► a\ — p<J\<J 2 . 

Thus condition p <a\/a 2 implies that a\ > cov for sufficiently large n. Thus, 
eventually will be positive. □ 

Proof of Proposition 2.2. From Lemma 2.1, it follows that the sec- 
ond term of Si goes to zero in probability as n goes to infinity, while the 
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first term converges to (0? - e®) 2 in probability. Therefore we have 



S«iH(0°_0O)2 asn-oo, 

where {9\ — O®) 2 ^ by assumption. 

Moreover, we see that iSf = Op(^). By definition of A^ , it follows that 



1*51 

This completes the proof. □ 



— > as n — > 00. 



Proof of Proposition 2.3. By differentiating De — f(l*A — 1) and 
setting the result to zero, it follows that 

^ ^ = -2b e + 2A e X° e pt -i/l = 0. 



It then follows that 



We then have 



A^^M&e + ^l 



1 = i*A° pt = l l A~ l { b e + ^ 



Thus 



Therefore 



2 it/i-ij 



1UJ 1 ! 



(l-1^6 e ). 



Since £>e"^ is a quadratic function of A and A > 0, the minimum is achieved 
at the point A° pt . Furthermore, by (5) and (7) we have 

A~% = A~ 1 (A 1 - e 2 n %) = (1, 0, 0, ... , 0)' - e 2 n A~ x %. 
Denote the optimum weight vector by A op . It follows that 

A° pt = (1,0,0,..., 0)' - e 2 n [a~ x % - ^p^l) . 

This completes the proof. □ 

Proof of Proposition 2.4. From (8), it follows that 
,opt _ (m/fai - l)) 2 ^ 2 



Xi = 1 v 1L 



n 1 (X 1 .-X 2 .^ + (l/(n 1 -l))a 2 1 
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By the weak law of large numbers, we have 

-2 p e° 2 
a 1 — >a x , 

(Il-I,) 2 ^^?-^) 2 ^. 



o. 



It then follows that 

_ (ni/(m - l)) 2 aj P g0 

n 1 (x 1 .-X 2 .y + (1/(^-1))^ 

We then have 

A° pt 1. 

The last assertion of the theorem follows by the fact that Ai + A2 = 1. 

Proof of Theorem 3.1. Consider 
1 1 ni 

-D ni (\) = -Y / (Xi 3 -Hd{- j) )) 2 

n-\ n-i rr ' 

ni 3=1 

1 ni 1 m 

= - - «K#" i} )) 2 + - E(^i _i) ) - *$~ j) )) 2 

r) "1 



Note that 



where 



1 ni 

= -£(*y - <^?))(^T J) ) - 
711 3=1 

+ -£(*(*?) - tf^Xrt^) - 
ni 3=1 

= S\ + S*2, 



* = - £(*u - - rt^)). 

ni i=i 
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s 2 = - 1>(0?) - m^mm^) - m 1} ))- 

Hl 3=1 

PgO 



We first show that Si — ► 0. 
Consider 

P e o(\Si\>e) 

= P e o(e< \Si\ and \</>0{~ j) ) - 

+ P o(e< \Si\ and |0(0^) 

<P e Je<\Si\<-Y,\Xi 3 - 

V ni j=t 



< M for all j) 
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The first term goes to zero by the weak law of large numbers. The second 
term also goes to zero by assumption (iv). We then have 



> h E , 
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We next show that S2 — > as Tlx —> 00. 
Consider 
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= P g o(e <\S 2 \ and |0(0$ -i) ) - <P{d[~ j) )\ < M for all j) 
+ P o(e < \S 2 \ and - <K#1~°)| > M for some J) 



<P g o \^e< \S 2 \ < 
m 



M 

m 

(-/). 



3=1 



\r l) )\>M) 



1=1 
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<Pe° 



M 



:£ < 



"i 



7=1 

+ n 1 P o(|^ 1) )-0(^ 1) )|>M) 
+ n 1 P o(|^" 1 - 1) ) 



P o 



-V 



> M). 

The first term goes to zero by assumption (ii). The second term also goes 
to zero by assumption (iv). We then have 

(19) P o(|5 2 |>£) — >0 asni^oo. 

It then follows that 

1 1 ni 



(20) 



where R r , 



1 ^)) 2 + Pn, 



i=i 



0. Observe that the first term is independent of A. Therefore 
the second term must be minimized with respect to A to obtain the minimum 
of ^j--D ni (A). We see that the second term is always nonnegative. It then 
follows that, with probability tending to 1, 



—An (A) > ^£(Ay - H^)) 2 = ^D ni (w), 



3=1 



since 



for \W =w = (1,0,0,..., 0)* for fixed n x . 



Finally, we will show that 

-v(cu) P 0° 

A v ' — ► wo as ni — >oo. 

Suppose to the contrary that A^ — ^ wq + d where d is a nonzero vector. 
Then there exists no such that for n\ > no, 

-D ni (X^)>—D m (w). 

This is a contradiction because A^ is the vector which minimizes ^-D ni 
for any fixed n\ and the minimum of ^■D Tll (X) is unique by assumption. 

□ 
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Proof of Lemma 3.1. Recall that the average discrepancy of cross- 
validation for the log-normal case is given by 

-D n (Xi,X 2 ) 
n 

(21) ^ 1 y^ x ^ _ e (Ai/(n-l))X) Mi Iog(a!u)+(>3/(»v-l))X; Mi Iog(*afc)+l/2j2_ 

U j=l 

(i) Since we require that Ai + A2 = 1 , we can rewrite the average discrep- 
ancy as 



(22) ±D n (l - A 2 , A 2 ) = " e Y i-+W* '^-'^f, 

n n 

where = log(Xy), i = 1, 2; j = 1, 2, . . . , n. 

Note that a(rr) = (x — a) 2 and /3(x) = e b * x+c are both convex functions 
for any given constants a, b and c. It then follows that j(x) = (e b * x+c — a) 2 
is also a convex function. Thus, —D n (l — A 2 , A2) is a strict convex function 
with respect to A 2 for fixed n. 

(ii) The first-order derivative of \D n is given by 

iaD n (l-A 2 ,A 2 ) 



n 8X2 

(23) =-- f^iXij - e F i-~ j)+A2 ( F 2 ri) - F i ri) ) +1 / 2 ) 



n . 



^1. J) +A 2 (>1 3) )+l/2 * - V ( ~ j ' )n 



Observe that 

yi-j) _ yj-J) = { y 2 _ _ Fi0 + _L_([y= 2- _ Fl-] _ [y 2j _ Fl .]). 

It then follows that 

iaD„(l-A 2 ,A 2 ) 



n 9A2 



r) 

(24) = " e yi - +A2 ( y ^ yi ') + ^ +1/2 ) 

* e F 1 . + A 2 (F,-y,)+T r+ l/2 ^ ((Fa _ _ ) + R n^ 



n . 



where 



(25) 



1 



2?(A 2 ) = A 2J R™ + -(Fj. - Y lj ). 

J J n — 1 
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Thus 

(26) — = -2F n (\ 2 ) * E n (\ 2 ), 

n OX 2 

where 

(27) F n (A 2 ) = e F,+A 2 (y 2 .-y 1 .)+i/2 , ^ 
and 

i n 

(2g) ^n(A 2 )--2^( X li- e J ) 



re . 
3=1 



*(i + ^/(y 2 . -y!.))*e^ . 

For any |A 2 | < 1, we have T™(A 2 ) = P {n~ l ) and e 2 ? = 1 + T™ + P (n" 2 ). 
Thus 

^(A 2 ) = - e ^-+A 2 (F,-F,) + T r+ i/ 2) 

(29) n i =1 _ _ 

* (1 + ^/(^1. ~Y 2 .)) 

*(i + r; + Op(n^ 2 )), |a 2 |<i. 

Furthermore, for any |A 2 | < 1 we have 

(3Q) £ n( A 2 ) = I ±( Xlj - e ^. + A 2 (F,-F,) + T r+1/2) 

+ U n {X 2 )/(Y 2 . - Fx.) + V n (X 2 ) + W n (A 2 )/(F 2 . - Fl), 

where 

C/n(A 2 ) = - " e^+^^-^W) * R n f 

re f— f - 1 
3=1 

K(A 2 ) = - - e n ( r J W^-F'-'m/S) „ (2 * + o P (n- 2 )), 

V^„(A 2 ) = I - eF^'+A.^-'-F^Vl/S) „ (T n + 0p („-2 )) , #n 

If |A 2 | < 1, then 

(31) \Y[: j) + A 2 (F^r j) - y[7 j) )\ < \Y$7 j) \ + |F 2 ( r i} - y[t 3 \ 

We also consider 
1 n 

^n(A 2 ) = -E^2) 
3=1 

i n / i r 1 i \ 
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Note that, for any sequence of random variables Zj , j = 1, 2, . . . , n, E\ (ZiZ^) \ < 
oo, i,k = 1,2, ... ,n, 



(32) 



P 



1 



n(n — 1) 

\ I 7 = 1 



1 



i k 



(33) 



By combining (25), (31) and (32), we can show that, for | A.2 1 < 1, 

U n (X 2 ) = P (n- 2 ), V n (X 2 ) = P (n- 2 ), 
W n (\ 2 ) = P (n- 2 ), B n (\ 2 ) = P (n~ 2 ). 

We also observe that 

I J2( Xlj - /i-+A2(Y 2 . -^0+^+1/2) 

l 



= - V(X y - e ^-+A 2 (y 2 .-y 1 .)+i/2 (1 + t™{\ 2 ) + Op(n" 2 ))). 

3=1 

It then follows that 



(34) 



En (X 2 ) = ~ e n.+A 2 (Y,-Y,)+i/ 2) + c (Aa) 



n t 1 



where C n = 5„(A 2 _)_+ U^X 2 )/(Y 2 . - Y x .) + V n (A 2 ) + W n (A 2 )/(y 2 . - 
It is clear that Y 2 . — Y\. ——>■ fi® — /j,®. Thus 



(35) 



E n {X 2 



1 



n 



Yi.+A2(y 2 .-Yi.)+l/2 



+ Op(n" 2 ). 



Without essential loss of generality, we assume that /J( > /i 2 . It then follows 
that 

(36) E n (l) ^ e^ +1/2 (l - e^-"?) < 
and 

(37) £„(-!) ^ e^ +1/2 (l - etf-fi) > 0. 

By (26), (27), (36) and (37), it follows that for sufficiently large n, 

dD n (l-X 2 ,\ 2 ) 



(38) 



4 aD n (l-A 2 ,A 2 ) 

— x * — 



d\ 2 



A 2 =l 



d\ 2 



A 2 =-l 



= F n (l) * F n (-1) * E n (l) * E(-l) < 0. 

Since D n is strictly convex, then its second-order derivative is positive. 
Therefore, the first-order derivative of D n is monotone. By (38), we then 
have that the optimal weight A2 S (—1,1) for sufficiently large n with prob- 
ability tending to 1. Furthermore, it converges to a unique limit. Suppose 
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that this is not true and there are two limits A2 and A^. Then 0.5A2 + 0.5A2 7 
achieves a small value for ^D n (l — A2, A2) since it is strictly convex. This is 
a contradiction. □ 

Proof of Corollary 3.2. It suffices to show the assumptions of The- 
orem 3.1 are satisfied for the log-normal case. 

(i) By Lemma 3.1, -^D n (l — \2,^2) is a strict convex function with re- 
spect to A2. Therefore assumption (i) of Theorem 3.1 is satisfied. 

(ii) We then check assumption (ii) of Theorem 3.1. Let = ^ X Ej=i(<K/n 
Thus 

1 1 n ^ 

± S I = lr/ e (V(«-l))E w Vu)+l/2 _ e M?+lM 

n n 

Let A" = e (V(n-l))EW°S(*ifc)+l/2 _ e M?+l/2. It then foUows that 

I I n 

n Sn ~ n 

Observe that Y i:j = log(Xy) ~ iV(/4>, 1), j = 1, 2, . . . , n. Thus we have 
(39) ^o(e log ^>*) = E(e Y » H ) = e ^ i+ * 2 / 2 , 

for i = 1, 2; j = 1, 2, . . . ,n. We then have 

E n e ( 1 /(«-l))EL 2 1 °g( X lfe) = f e l/C«-lK+l/(2(n-l) 2 )Nn-l 
1 j =e M?+l/(2(n-l))_ 

We also have 

£ oe (l/(n-l))Y:^ =2 log(X lfe ) te (l/(n-l))Er =1 l<«(^) 

= £ ( e (V(«-l))log(^ll)+(l/(n-l))log(X ln )^_ E ^(2/(n-2))y]^: 2 1 log(X lfc )) 
= e 2*(l/(n-l)K'+l/(2(n-l) 2 ) # e („-2)*(2/(n-l)) M ?+ 2/(n-l) 2 ) 
= e 2^ + ((2n-4)/(n-l) 2 )_ 

(41) 

By (39) and (40), it then follows that 
-£Ao (A™) 2 = e * B.oIeWM)^^) _ ^2 



e*[^o(e( 1 /("- 1 ))SL 2 M^ fc ) ) 2 

-2e#*E ( e (V(«-i))EL 2 ^(^«)) + e ^l 
e * [^(e^- 1 ))^ 1 ^!,)) _ 2e M? * e M?+l/(2(n-l)) + e 2 Mi] 
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* [(g^efyfn-i))^^))"- 1 _ 2e^? * e M?+l/(2(n-l)) + e 2^ 



e * [ e 2M?+2/(n-l) _ 2eM ; * e ^+l/(2(n-l)) + e 2 M °] 



[by (15)] 



By (40) and (41), we also have 

Efi(A? * A n n ) = ^( e W("-i))EL 8 M%)+i/2 _ e M?+l/2) 
* ( e (V(«-i))Er=i llo g( x ife) +1 /2 _ e ^?+i/2) 
= eiE^el 1 /^ 1 ^:^ 1 ^) * ^/(n-iD^^iog^) 
- 2e^ * E, Mn-i))Yr k ^{x lk ) + e 2 M ? } 

Pi 7 

= e * ( e 2^? + ((2n-4)/(n~l) 2 ) _ 2 # ^ # e/i °+l/(2(n-l)) + ^ ) 
= e 2 A1 0+l^((2n-4)/(n-l) 2 ) _ 2 ^ e l/(2(n-l)) + ^ 

= e 2M ( 1 > + l fI 



■;?■ 



For any fixed j and fc, we then have 



(42) 



Therefore, 



E^(A]*Al)=0(- 



n 



>e < 



I\2 



n 2 e 2 



E(S' n ) 



n 



k~2 E (£( V) + ^2 f E E ^? * ^ 



i 

ne 



t-,i Jfl N2 n ( n ~ 1) 
2 £(,4?) 2 + — - 



oil 



n 



n 2 £ 2 Mi 

as n — > oo. 



This implies that assumption (ii) is satisfied for the log-normal case, 
(iii) Let 



i i n 

3=1 
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Observe that 



i i n i n i n 

n re f— f J re r— : n r— J 



JT + JJ + IT 



3 ! 



where If = I £? =1 ^ = £ £?=i X^™) ^d = ± £? =1 ^(Ai 
By the weak law of large numbers, it follows that 

(43) A™ = - E *ij ^(^n) 2 = ^ 1+2 



n . 



Consider 



-i n i n _, 

(44) I- = ^J2 X yM~ j) ) = lj]eW M Sw yu+1/S 



re r-r re . 

where Yy = log(Xy) ~ iV^?, 1), j = 1,2, . . . ,n. 
Note that for any j 

n-l~ n - 1 — 1 

where Fi. = ^£fc=i*ijfc- 
It then follows that 

(45) = e 1/2 * eWC"- 1 ))^- * ( - E e ((^-2)/(n(n-i)))*y lfe 



Note that 



I e ((n-2)/(n(n-l)))*y lfc 



(46) 1 » / „-2 \ ",° 
It then follows that 

F o 

(47) I 2 n -A e^ +1/2 as n oo. 
We also have 

_ I " P..o 

v, J- 



I n = e * f&n/in-lVYi. _|_ y> e -(2/(n-2))y, ^ + 1 

3 n f-{ 

It then follows that 

(48) -S" e 2 ^+ 2 - 2 * e^ +1 /2 + e 2„°+i_ 

n 

This implies that assumption (hi) is satisfied. 
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(iv) We are now in a position to verify Assumption (iv) of Theorem 3.1. 
Note that the optimum weight X 2 is chosen such that 

dD n (l-X 2 ,\ 2 ) 



0. 



A 2 =A^ 



By (27), we see that either F n > or F n < for sufficiently large n if fj,± 7^ fi 2 - 
By (26), (35) and Lemma 3.1, it follows that the optimum weight \ 2 (n) 
satisfies 

1 n — — — 

(49) = E n (\* 2 ) = ~ - eY^iYi-Y,)^ + 0p(n -2y 



We then have 
(50) 0(/2?) 



e y, + A*(y,-y,) + i/ 2 = I ^ x±j + 0p{pr *y 



For sufficiently large n and any constant M > 0, say 1, and a certain 
C(M), which depends on M and whose value is of no relevance to the 
argument, we have 



p,j>(\m)-m)\>M) 



e l/nE" =1 lo g (X lj)+ l/2 _ 1 £ ^ . + Gp(n - 2) 



n f , 



> Af 



< P. 



^/n^U^^ 1 ' 2 - e M?+l/2| > M/2 ) 



n ~ 



3 A.°+l/2 > M / 2 j + 0(„-2) 

>C(M) J 



n T~t 

3=1 



> M/2 + 0(n" 



= 0(n" 2 ). 

The last inequality follows since the fourth moments of X\j and log(Xi,-) 
both exist for any fixed j. Therefore, the last assumption of Theorem 3.1 is 
satisfied for the log-normal case. This completes the proof. □ 

The proofs of Theorems 3.2-3.4 resemble the proofs for fixed weights as 
given by Wang, van Eeden and Zidek (2004). These theorems can be proved 
by using Theorem 3.1 and replacing fixed weights with adaptive weights in 
weighted likelihood estimation. Details can be found in Wang (2001). 
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